1. Introduction


What is Power-lifting?

Powerlifting is the ultimate strength competition. The International Powerlifting Federation is the head of nearly 100 country federations world-wide. The powerlifts:- squat, bench press and deadlift are increasingly being recognized as principal exercises in the development of an individual’s true strength. But, analyzing the factors affecting the power lifts forms the core of this project.

Problem Statement:

Analyze the various factors affecting the ability of the person to do power-lifting. This project is all about understanding this exciting and intense sport better.

Key questions that will be answered in this project are -

  • How is the weight lifted by a person being impacted by age?
  • How is the weight lifted by a person being impacted by gender?
  • Is age impacting the equipment being used to do powerlifting?
  • Which section of people are more likely to get disqualified?
  • How is age impacting different types of lifts such as bench, deadlift and squat?
  • How is gender impacting different types of lifts such as bench, deadlift and squat?
  • Is there a relationship between bodyweight and position?
  • How is the gender split in different meets? Is it changing over the years?
  • What is the distribution of different age classes in the meets? How it varies over the years?
  • Is the participation increasing over the years?
  • Is the average weight lifted increasing over the years?

Solution Overview/ Approach:

We will be using a subset of the International Powerlifting Data in this project. Once we are done with the tedious but necessary task of data cleaning, we plan to perform univariate/ multi-variate analysis of the various attributes affecting the performance of the individual.We will also look at scatter plots and histograms broken down by multiple filters to be able to visually comprehend the influence of different attributes on the performance. We will also perform hypothesis testing to see if few of the relationships are statistically significant.

Finally, the relation will be explained with the help of a regression model which will give us a mathematical equation of the relationship between independent variables such as age, gender, body weight with the response variable - “ability to lift” across different divisions and world wide meets.

In short - this report aims to equip the consumer with the ability to guesstimate and be able to explain the lifting capacity of an individual, when provided with certain basic data points such as gender, age, bodyweight etc.

3. Data Preparation


This section talks about the various steps involved from importing the data set to the cleaning of data set. Each step of this process has been explained and the corresponding code has been shown. An option has been provided to hide the code if the reader wishes to skip reading the code. All the tables have been provided with a HTML scrollable format of the data. This will help us filter and check only the columns of interest using the option “Column Visibility” in addition to the “Search bar” given on top of the table.

At the end of this step, the data set will be ready to use for the analysis.

3.1 Data Source and Explanation

The data set that is being used in this project can be found on the github page: IPF Cleaned Dataset (Note - This link works only when opened in a new window)

The original source of the data set which is a much larger dataset can be found at: Open Powerlifting

This data set aims to create an accessible, accurate and open archive of the world’s powerlifting data. The original dataset is being updated every month, but the data set we are using for this project is last collected on September 20th 2019 and consists of data until August 26th 2019.

The dataset we are using in this project contains 41,152 observations with 16 variables having information on player’s gender, age, weight, lifting capacity, event participated and the date of the meet. We notice that the missing values have been recorded as "NA". Some data entries to call out are negative values in the weight lifted variables which indicate failed attempts. The dataset also contains information on people who were disqualified and guest lifters i.e. people who succeeded but aren’t eligible for awards.

Our preliminary observation suggests there might be few data entry errors, outlier values and irrelavant columns for the analysis, which have been dealt with in the subsequent data cleaning steps.

3.2 Importing Dataset and Preliminary Checks

First, let us start with the important step of importing our dataset -

## 'data.frame':    41152 obs. of  16 variables:
##  $ name            : chr  "Hiroyuki Isagawa" "David Mannering" "Eddy Pengelly" "Nanda Talambanua" ...
##  $ sex             : chr  "M" "M" "M" "M" ...
##  $ event           : chr  "SBD" "SBD" "SBD" "SBD" ...
##  $ equipment       : chr  "Single-ply" "Single-ply" "Single-ply" "Single-ply" ...
##  $ age             : num  NA 24 35.5 19.5 NA NA 32.5 31.5 NA NA ...
##  $ age_class       : chr  NA "24-34" "35-39" "20-23" ...
##  $ division        : chr  NA NA NA NA ...
##  $ bodyweight_kg   : num  67.5 67.5 67.5 67.5 67.5 67.5 67.5 90 90 90 ...
##  $ weight_class_kg : chr  "67.5" "67.5" "67.5" "67.5" ...
##  $ best3squat_kg   : num  205 225 245 195 240 ...
##  $ best3bench_kg   : num  140 132 158 110 140 ...
##  $ best3deadlift_kg: num  225 235 270 240 215 230 235 335 310 295 ...
##  $ place           : chr  "1" "2" "3" "4" ...
##  $ date            : chr  "1985-08-03" "1985-08-03" "1985-08-03" "1985-08-03" ...
##  $ federation      : chr  "IPF" "IPF" "IPF" "IPF" ...
##  $ meet_name       : chr  "World Games" "World Games" "World Games" "World Games" ...

From the structure, we notice that the dataset has 41,152 rows and 16 columns - the dataset has been imported correctly. We see that the column names are appropriate and have suitable datatype assigned except for the column “date”, which is in character format.

We change the class of Date variable to date format

Checking the structure again to verify if the command worked and new NA’s are not introduced.

## 'data.frame':    41152 obs. of  16 variables:
##  $ name            : chr  "Hiroyuki Isagawa" "David Mannering" "Eddy Pengelly" "Nanda Talambanua" ...
##  $ sex             : chr  "M" "M" "M" "M" ...
##  $ event           : chr  "SBD" "SBD" "SBD" "SBD" ...
##  $ equipment       : chr  "Single-ply" "Single-ply" "Single-ply" "Single-ply" ...
##  $ age             : num  NA 24 35.5 19.5 NA NA 32.5 31.5 NA NA ...
##  $ age_class       : chr  NA "24-34" "35-39" "20-23" ...
##  $ division        : chr  NA NA NA NA ...
##  $ bodyweight_kg   : num  67.5 67.5 67.5 67.5 67.5 67.5 67.5 90 90 90 ...
##  $ weight_class_kg : chr  "67.5" "67.5" "67.5" "67.5" ...
##  $ best3squat_kg   : num  205 225 245 195 240 ...
##  $ best3bench_kg   : num  140 132 158 110 140 ...
##  $ best3deadlift_kg: num  225 235 270 240 215 230 235 335 310 295 ...
##  $ place           : chr  "1" "2" "3" "4" ...
##  $ date            : Date, format: "1985-08-03" "1985-08-03" ...
##  $ federation      : chr  "IPF" "IPF" "IPF" "IPF" ...
##  $ meet_name       : chr  "World Games" "World Games" "World Games" "World Games" ...
##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division    bodyweight_kg 
##             2906             2884              627              187 
##  weight_class_kg    best3squat_kg    best3bench_kg best3deadlift_kg 
##                1            13698             2462            14028 
##            place             date       federation        meet_name 
##                0                0                0                0

Let us see if the data is unique at this level -

As there is no change in the number of observations, the data is unique at this level.

Let us also rename the column ‘place’ to ‘position’ to make it more intuitive. Also we are getting rid of the suffix “kg” for few columns-

##  [1] "name"             "sex"              "event"            "equipment"       
##  [5] "age"              "age_class"        "division"         "bodyweight_kg"   
##  [9] "weight_class_kg"  "best3squat_kg"    "best3bench_kg"    "best3deadlift_kg"
## [13] "place"            "date"             "federation"       "meet_name"
##  [1] "name"          "sex"           "event"         "equipment"    
##  [5] "age"           "age_class"     "division"      "bodyweight"   
##  [9] "weight_class"  "best3squat"    "best3bench"    "best3deadlift"
## [13] "position"      "meet_date"     "federation"    "meet_name"

Now, we have fixed the datatype and column names - Let’s move ahead!

We will remove the column ‘federation’ as it has only one unique value and would not add any value to the analysis

## [1] 41152    15

Among the columns, age and age_class, we might only use either of them for our analysis. Also, as age groups are randomly distributed in this column, we might not use this column and might just stick to using the age column but let us not remove any of the columns now, which might turn out useful for the analysis later.

One possible way the age_class column would be helpful is to look at the distributions across multiple classes as this has only 16 unique values, we will retain this column for now.

##  [1] NA       "24-34"  "35-39"  "20-23"  "40-44"  "45-49"  "16-17"  "18-19" 
##  [9] "50-54"  "13-15"  "55-59"  "60-64"  "65-69"  "70-74"  "75-79"  "80-999"
## [17] "5-12"

We see that one particluar class is wrongly named - so renaming age class of 80-999 to 80-99, but however as all age groups are randomly distributed, we might not use this column and might stick to just using the age column.

##  [1] NA      "24-34" "35-39" "20-23" "40-44" "45-49" "16-17" "18-19" "50-54"
## [10] "13-15" "55-59" "60-64" "65-69" "70-74" "75-79" "80-99" "5-12"

Let us check for any similar outlier in the weight_class column:

##  [1] "67.5"  "90"    "90+"   "52"    "56"    "60"    "75"    "82.5"  "100"  
## [10] "110"   "125"   "125+"  "44"    "48"    "67.5+" "100+"  "57"    "63"   
## [19] "72"    "84"    "84+"   "74"    "83"    "105"   "120"   "120+"  "66"   
## [28] "105+"  "72+"   "53"    "59"    "93"    "43"    "47"    "110+"  "82.5+"
## [37] NA      "40"    "75+"

For Weight class column, the groups are again randomly distributed with multiple classes and this column might not be useful when we try to correlate the amount of weight lifted with the weight column. However, we will try to standardize the weight class column as it has too many classes. Let us create a new column with equal spaced buckets.

##  [1] "1-10"    "10-20"   "20-30"   "30-40"   "40-50"   "50-60"   "60-70"  
##  [8] "70-80"   "80-90"   "90-100"  "100-110" "110-120" "120-130" "130-140"
## [15] "140-150" "150-160" "160-170" "170-180" "180-190" "190-200" "200-210"
## [22] "210-220" "220-230" "230-240" "240+"
##  [1] 60-70   90-100  <NA>    50-60   40-50   70-80   80-90   100-110 120-130
## [10] 110-120 130-140 140-150 150-160 160-170 170-180 190-200 180-190 30-40  
## [19] 200-210 210-220 240+   
## 25 Levels: 1-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 ... 240+

Now, we will look at the summary statistics of the numeric variables. This will help us understand how the values are distributed in these columns:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.50   22.50   31.50   34.77   45.00   93.50    2906
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   37.29   60.00   75.55   81.15   97.30  240.00     187
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -210.0   160.0   215.0   217.6   270.0   490.0   13698
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -160.0    97.5   140.0   144.7   185.0   415.0    2462
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -215.0   170.0   222.5   221.8   270.0   420.0   14028

Let us also check for missing values before we determine what our next steps should be -

##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division       bodyweight 
##             2906             2884              627              187 
##     weight_class       best3squat       best3bench    best3deadlift 
##                1            13698             2462            14028 
##         position        meet_date        meet_name bodyweight_class 
##                0                0                0              187

Having looked at the numerical summaries, we see that the data does have both extreme values and missing values. We will have to clean this dataset further, so let us deal with each of these variables separately in the next step of the data cleaning process.

3.3 Missing Value and Outlier Treatment

We observe that there are missing values in multiple columns, we decide to clean the data by selecting each of the column and check for outliers and missing values.

For Columns - weight_class and bodyweight

As Weight Class column has only one missing value, we remove it as it doesn’t add any value.

##  [1] "67.5"  "90"    "90+"   "52"    "56"    "60"    "75"    "82.5"  "100"  
## [10] "110"   "125"   "125+"  "44"    "48"    "67.5+" "100+"  "57"    "63"   
## [19] "72"    "84"    "84+"   "74"    "83"    "105"   "120"   "120+"  "66"   
## [28] "105+"  "72+"   "53"    "59"    "93"    "43"    "47"    "110+"  "82.5+"
## [37] "40"    "75+"

Missing value for the Weight Class column has been removed.

We can impute for missing values of bodyweight from the column weight class but again, this wouldn’t make much sense as the weight class is a range and the tradeoff is making data available for just 187 rows (~0.5% of data) that too with error after a lot data computation. So, we just go ahead and remove these rows.

##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division       bodyweight 
##             2830             2808              613                0 
##     weight_class       best3squat       best3bench    best3deadlift 
##                0            13663             2452            13990 
##         position        meet_date        meet_name bodyweight_class 
##                0                0                0                0

We also see that the missing values for Body Weight column have been removed. Also, as we do not get any additional information from the old column - “weight_class”, So we just get rid of this column and see if the dataset is unique at this level.

##  [1] "name"             "sex"              "event"            "equipment"       
##  [5] "age"              "age_class"        "division"         "bodyweight"      
##  [9] "weight_class"     "best3squat"       "best3bench"       "best3deadlift"   
## [13] "position"         "meet_date"        "meet_name"        "bodyweight_class"

Let us check for outliers in the column bodyweight now,

As We observe that there are lot of outliers in the higher quartile,let us follow the standard 1.5(IQR) approach to deal with the outliers

(outlier_bodyweight <- quantile(ipf$bodyweight, c(0.75), na.rm = TRUE) + 
    (1.5 * IQR(ipf$bodyweight, na.rm = TRUE)))

We check for outliers in the body weight column above 153.25 to see the number of rows that have the value of 153.25 and above

We observe that there are 410 observations. We decide to go ahead and remove these observations

For Columns - age and age class

We will check if we can derive the missing values from either of the 2 columns age and age class.

##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division       bodyweight 
##             2823             2801              610                0 
##       best3squat       best3bench    best3deadlift         position 
##            13476             2405            13802                0 
##        meet_date        meet_name bodyweight_class 
##                0                0                0

We observe that there are 2823 and 2801 missing values in the columns age and age class.

## [1] 0.5
##   age age_class
## 1 0.5      <NA>

From the row with value equal to 0.5, we cannot infer anything meaningful i.e., we will not be able to impute the missing values from this column and this could a data entry error.

Now, as we cannot infer the value of age from this, it doesn’t make sense to impute these values. So we decide leave these rows as it is and We will not remove these rows as we would lose ~7% of the data if we do so, also we can use the information from other columns for these rows.

## [1] "numeric"

Now, let us look at the outliers for the column age.

We notice that there are lot of outliers in the extremities of higher scale.

## 75% 
##  80

We check for outliers above the age of 80 in the age column and let us see how many rows have a value of 80 and above.

We decide to remove these rows where the age is greater than 80 - as we only have 41 observations. Also we get rid of the observation, where age is 0.5, which seems to be a clear anomaly. Also, we will get rid of the row in which the age class is mentioned as “5-12” as it has only one record.

##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division       bodyweight 
##             2823             2800              610                0 
##       best3squat       best3bench    best3deadlift         position 
##            13445             2404            13773                0 
##        meet_date        meet_name bodyweight_class 
##                0                0                0

For the division column, there are 610 missing values and we can impute them if the distribution of age within classes is consistent. This doesn’t seem to be the case here and hence these values cannot be imputed.

However, We will not remove the missing values now as we are not sure if this variable will be used in the analysis. Also, considering the number of NA (~1.5% of the total data), we decided to leave them as it is. Another major reason is that these values belong to the time period between 1985-1994, which could be useful information from other columns while doing the analysis.

Also, let us look at the unique values and count for each of the type of event.

## 
##     B    SB   SBD 
## 12343     2 28166

As we have only 2 observations for SB, we can and hence can remove these observations. The rationale for removing is that 2 is a very small number to understand pattern or draw insights from.

Now, we are left with 3 numeric columns - best3squat, best3bench and best3deadlift.

##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division       bodyweight 
##             2823             2800              610                0 
##       best3squat       best3bench    best3deadlift         position 
##            13445             2404            13771                0 
##        meet_date        meet_name bodyweight_class 
##                0                0                0

Although we have approximately 13,500 and 13,800 missing values for squat and deadlift, we are not really concerned about them as a majority of them fall under a particuar event “B”, where they are not supposed to have these values.This is shown below

##             name              sex            event        equipment 
##                0                0                0                0 
##              age        age_class         division       bodyweight 
##             2519             2498              327                0 
##       best3squat       best3bench    best3deadlift         position 
##             1102             1349             1428                0 
##        meet_date        meet_name bodyweight_class 
##                0                0                0

Now, We observe that the resulting dataset with the event type as “SBD” has approximately 1100 to 1500 missing values as compared to the previously seen ~13k missing values. So, we subset the dataset into 2 datasets, one for Bench and one for SBD

Now, we will have to look at the 3 variables - Bench, Deadlift and Squat for treatment.

However, only the variable bench for the dataset ipf_Bench,

##   75% 
## 347.5

We observe that there are outliers above the best3bench of 347.5. Let us see how many rows have value above 347.5.

We decide to remove these rows where the value is above 347.5 as we have 22 observations only.

We Check for outliers for negative values and observed that we do not have any.

Now, let us proceed to look at the other dataset where we have observations for all the 3 variables - Squat, Deadlift and Bench i.e. we will look at the dataset ipf_SBD.

For bench in event SBD,

We observe that there are outliers at both the extremities. Although, we have a reason for the existence of negative values, we go ahead and remove them as they are supposed to be rare events with very limited number of observations for us to draw any conclusions or insights.

##   75% 
## 292.5
##   25% 
## -27.5

We see outliers above 292.5 and below -27.5, let us see check for number of rows having values above 292.5 and below -27.5.

There are 151 observations and we decide to remove these rows where the value is above 292.5 and below -27.5.

For squat in event SBD,

Again, we notice that there are outliers on both positive and negative ends.

##   75% 
## 422.5
## 25% 
## 2.5

Here, we observe outliers above 422.5 and below 2.5, let us check for the number of rows that have values above 422.5 and below 2.5.

There are 25 observations and we decide to remove these observations.

For deadlift in event SBD,

We notice that there are outliers only in the lower extremity. However, let us check on both extremities for outliers and also the number of rows that have value above 420 and below 20.

## 75% 
## 420
## 25% 
##  20

There are 3 observations and We decide to get rid of these observations.

As we evaluated the event types separately to avoid removal of values that are not outliers within a specific group, we will now combine them to form the final dataset and reorder the columns.

Performing few sanity checks on the combined dataset and displaying the head for the final dataset we will be using for the analysis

Finally, after cleaning the dataset, we are left with a dataframe of dimensions 40308, 15

Let us look at the summary statistics for the numerical variables:

age: The minimum and maximum values are 13.5 and 80 after removing ~40 outlier observations. The descriptive statistics are shown below:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.50   22.50   31.50   34.78   45.50   80.00    2822

bodyweight: The minimum and maximum values are 37.29 and 94.80 after removing ~410 outlier observations. The descriptive statistics are column is shown below:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37.29   60.00   75.00   80.10   94.80  153.24

best3squat: The minimum and maximum values are 25 and 422.5 after removing ~25 outlier observations. The descriptive statistics are column is shown below:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    25.0   160.0   212.5   215.0   265.0   422.5   13415

best3bench: The minimum and maximum values are 25 and 347.5 after removing ~170 outlier observations. The descriptive statistics are column is shown below:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    25.0    95.0   140.0   142.7   182.5   347.5    2398

best3deadlift: The minimum and maximum values are 25 and 420 after removing ~3 outlier observations. The descriptive statistics are column is shown below:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    25.0   170.0   220.0   220.3   270.0   420.0   13739

From the summary statistics of the numeric columns, we observe that the distribution of the data points for the variables is not skewed as the difference between Mean and Median is minimal.

3.4 Final summary of the Dataset

A few comments on the dataset -

  • We have removed outliers and treated NA’s to a major extent

    • There might still be some outliers but due to limited context on data, we refrain from removing these values
  • For the column - “age”, we could have derived some of the NA values from the column “age_class”

    • However, this involves a series of data transformations and we observed that ~10% of NA values will be fixed despite this
    • As we already have ~37k values, these NA values for age should not cause us any trouble
    • So, we do not perform any treatment as the trade off is low for the amount of computation we perform
  • Although we will not be using the “name” column in our analysis - we will retain this column in the main dataset due to the following reason

    • During the course of analysis, when we subset the data and come across any interesting finding which is particular to a small sub-set of the population, this column will help us identify these observations

Data Dictionary:

variable_name data_type description
name character Individual lifter name
sex character Binary gender (M/F)
event character The type of competition that the lifter entered - unique values are SBD(Squat Bench Deadlift) and B(Bench)
equipment character The equipment category under which the lifts were performed. Values such as Raw (Bare Knees or Sleeves), Wraps(with knee wraps) and Single-ply (single-ply suits)
age double The age of the lifter on the start date of the meet, if known.
age_class character The age class in which the filter falls, for example 40-45
division character Text describing the division of competition, like Open or Juniors 20-23 or Professional.
bodyweight double The recorded bodyweight of the lifter at the time of competition, to two decimal places.
bodyweight_class character The weight class in which the lifter competed, to two decimal places. Max specified by number and min specified by the symbol “+” to the right
best3squat double Maximum of the first three successful attempts for the lift - squat.
best3bench double Maximum of the first three successful attempts for the lift - bench.
best3deadlift double Maximum of the first three successful attempts for the lift - deadlift.
position character The recorded place of the lifter in the given division at the end of the meet. Special Values: G - Guest Lifter, DQ - Disqualified, DD - Doping Disqualification
meet_date date The date of the meet
meet_name character The name of the meet without the year. Can be seen in conjuction with the date column

4. Exploratory Data Analysis


We have structured this section into 3 parts -

  • The first part deals with distribution of the variables and check for any exceptions stand out after our data cleaning step
  • Next, we look at how different variables such as age, bodyweight, sex affect powerlifting. Each of the variables are visualized at different granularities. Also, each of the visualization is followed by a note on the findings from the graph.
  • In the final step, we built a model which shows the mathematical relation of how different variables affect the total weight lifted by the powerlifter. This relation can be used to predict the powerlifting capacity of the powerlifter with ~75% accuracy.

4.1 Variable Distributions

Firstly, let us look at the distributions of the numeric variables: Age, Bodyweight, Best3squat, Best3bench & Best3deadlift.

Distribution for Age Variable :-

We notice a right skewed distribution and most of the participation instances are by people of age ~20.

Distribution for Bodyweight Variable :-

A majority of participants have their weights between 40 and 100.

Distribution for Best3squat Variable :-

Most of them lift between 100kgs and 300 kgs in the squat category and the distribution closely resembles that of a normal distribution.

Distribution for Best3bench Variable.

Most of them lift between 70kgs and 200 kgs in the bench press category and the distribution is slightly skewed towards the right.

Distribution for Best3deadlift Variable.

Most of them lift between 120kgs and 320 kgs in the deadlift category and the distribution closely resembles that of a normal distribution.

Character Variables:-

A quick summary of the few important character variables tells us that we almost have double the number of observations for males as compared to females. This is the same case with equipment where SBD has ~2X number of observations as comapred to the event B.

We also notice that the equipment “single ply” has the most available information while Wraps has too less information for analysis.

4.2 Understanding Behavior of Variables

We will center our EDA on the distribution of variables such as age, bodyweight, equipment and how they impact the weight lifted by individuals.

Let us analyze how the factors like Age and Sex are affecting the weight lifting capacity.

For this, we decided to subset the dataset by event “SBD” and create a new variable for the total weight lifted in the three events i.e., Bench, Squat and Deadlift after replacing the missing values with zero.

Let us now plot a scatterplot to look at the impact of age on total weight lifted.

From the above Scatterplot, we observe that the weight lifting capacity is higher among the age groups between 20 to 40 and then, we can observe a downward trend from there on. Though there’s downward trend with the increase in age, it’s interesting to see that people lying in the age group 60-80 are still able to lift weights of ~600 kgs too.

Let us now do a scatterplot to look at the impact of age on different types of lift - best3squat, best3bench & best3deadlift.

The decline in the lifting capacity is almost same across all 3 types of lifts. However, squat shows a sharper decline in the lifting capacity after the age of 40 as compared to the gradual decline in the other 2 types of lifts.

Now, let us use a scatterplot to look at the variance in weight lifting capacities between male and female powerlifters

We can observe that the weight lifting capacities are higher in male powerlifters than female powerlifters. Especially, when we compare male and female powerlifters for age group between 20 and 40, the total weight lifted is around 1000 kgs for males and 700 kgs for females.

One interesting thing is that we see a downward trend in weight lifting capacities for females starting at the age of 30 yrs and for males it starts slightly later at the age of 35 yrs. The powerlifters from both the genders are able to lift reasonably higher weights even after an age of 60 yrs.

Now, let us understand how the lifting capacity varies by different lifts.

The weight lifted in squat seems to be the most impacted by gender in terms of the weight lifted, although there is a very big impact of gender on the weight lifted overall. Also the weight lifted starts to decline with age faster for females as compared to males by a small margin, this is more evident in the cases of deadlift and squat.

Let us observe if age is related to the equipment usage in powerlifting. For this, let us look at the distribution of age classes for each of the equipments being used.

From the above bar plot, we see that the distribution of age classes remains almost constant across the equipment types - Raw and Single-ply.

An intersting thing is that people between ages 16 to 19 prefer to perform lifts with bare knees or knee sleeves.

There is no indication that the usage of equipment is being impacted by the persons age as the distribution of age classes appears to be constant across different equipment types. We do not have enough information to comment on the distribution of age classes using wraps.

Let us now look at how using bodyweight is impacting the total weight lifted using different equipments.

From the above plots, we can clearly see that there is an increase in the total weight lifted using single-ply equipment when compared to raw weight lifting. This suggests a strong correlation between total weight lifting capacity and equipment usage.

Another intersting observation is that the increase in total weight lifted is almost negligible when body weight increases beyond 90kgs. Further, We will try to statistically prove whether there is strong relation between total weight lifted and equipment usage through hypothesis testing.

Let us create two subsets, one for single-ply equipment weight lifting and the other for raw weight lifting. Let us perform Welch Two Sample test to prove that the difference between the means for two samples is greater than zero.

## 
##  Welch Two Sample t-test
## 
## data:  ipf_singleply$totalweight and ipf_raw$totalweight
## t = 27.871, df = 8997.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  69.24364      Inf
## sample estimates:
## mean of x mean of y 
##  572.6758  499.0888

From the results of the tests, we notice that alternative hypothesis i.e., the difference between the means for two samples, using single-ply and raw weight lifting, is greater than zero, is true. We can conclude that the average total weight lifted is higher using an equipment like single-ply when compared to raw weight lifting.


Let us now plot a barplot to observe the distribution of usage of equipment among male and female weight lifters.

From the above barplot, we observe that the lifters of both the genders prefer to use single-ply equipment more than raw. We can notice that there is only a slight difference in percentage of usage of single-ply equipment between males and females.

Further, Let us also look at the impact of equipment usage on weight lifting capacity stands true for both the genders. We will perform hypothesis tests to prove that the average total weight lifted using single-ply equipment is greater than raw weight lifting for both male and female lifters.

## 
##  Welch Two Sample t-test
## 
## data:  ipf_singleply_male$totalweight and ipf_raw_male$totalweight
## t = 18.182, df = 5358.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  45.3992     Inf
## sample estimates:
## mean of x mean of y 
##  653.4089  603.4932

From the above results, we see that the usage of single ply equipment by males increases the amount of weight lifted by them, which is in line with the overall population.

## 
##  Welch Two Sample t-test
## 
## data:  ipf_singleply_female$totalweight and ipf_raw_female$totalweight
## t = 24.825, df = 4931.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  48.42398      Inf
## sample estimates:
## mean of x mean of y 
##  401.5128  349.6519

We see similar behavior among females as well. There are no surprises here.


Let us try to profile which section of population tend to get disqualified from various meets. We will start off by looking at the gender -

We see that instances of males being disqualified are 4 times more likely as compared to females. Let us see if the number of males being disqualified is being driven by a certain age group.

A major chunk of disqualification instances occur in the age group 24 - 34 followed by the age group 20 - 23.

Now, let us see if they happen in the same meet or distributed between different meets.

A lot of disqualifications occur in “World bench press championship” followed by worlds masters powerlifting championship and Men’s world powerlifting championship.

Now let us see if the bodyweight or age of the person impacts the final position of that person in the meet. We will look at the correlation values to check this.

## [1] 0.08960862
## [1] -0.1488544

We only consider the numeric values of the position variable to check if there is any correlation between the position and bodyweight of the person. The correlation values are very low which suggests that there is no direct impact of bodyweight on the position.

Let us now observe the distribution of weight lifters by sex across various meets.

Males participation is significantly greater than females participation in meets such as “world disabled bench press championship” and “world masters powerlifting championship”. One interesting thing is that there is higher male participation in all the bench press championships.

Let us add a year,month and date column which will help us look at trends over time.

Let us see if participation by gender is varying over time. As there are multiple meets - we will look at the top 3 meets by participation

meet_name count
World Bench Press Championships 7036
World Masters Powerlifting Championships 5936
World Classic Powerlifting Championships 3903

We looked at the top 3 meets by participation and below are the findings -

  • World Bench Press Championships - The male participation percentages are highly fluctuating and the male participation continues to be 60% or higher for most of the years starting from 1990.

  • World Classic Powerlifting Championships - There is a visible increase in women participation from 2013.

  • World Masters Bench Press Championships - The male participation is on the higher scale from 2000, with a slight improvement being shown in the last 3 to 4 years.

Let us now plot a barplot to observe the distribution of powerlifter’s Age Classes across various meets they participated.

From the above graph, we observe that the most of the participants across various meets belong to the 18-39 yrs age group. Interestingly, we can see that in 5 of these meets, there is significant participation of powerlifters belonging to the 40-64 yrs age group.


We see a an increase in participation of powerlifting over the years, Although the rise has been slow in the initial few years, it has really picked up in the last 10 years.

We can check if this is the case with both males and females -

We see that the increase in participation for females has been higher in the recent 10 years as compared to the steady increase in the participation of males.


One other interesting thing to see is if the lifting capacity of males and females have increased over the years

These graphs show that the average weight lifted has been highly fluctuating for males in the last 40 years. However, for females - the average weight lifted has been trending upwards from 1980 till 2009 and then shows a declining trend. A thing which stands out here is that, the year 2009 is when women had shown peak performance in powerlifting over the last 40 years.

4.3 Modeling

Methodology

We will subset the dataset to just include event SBD as it has weight lifts across all 3 events. The reason we do this is to use the response variable as total weight lifted and reduce skew to the maximum possible extent at the same time.

We currently have ~27k observations and we get rid of NA values which will reduce the number of observations by ~3.5K. As we are still left with ~24k observations, which are good enough to fit a regression model - we go ahead and omit the NA values.

Then, we further subset to just include variables which makes sense. The variables which we want to include which could have a possible impact on the total weight lifted are - sex, equipment, age, bodyweight and total_weight. So,we add in the variable total_weight to this subset - which is the sum of weights lifted in deadlift, bench and squat.

We will have to convert the variables sex, equipment to factors to be able to include them in the lm function.

We then choose a base model with the help of backward elimination. The scope for this procedure is between an intercept model and a model with all the variables upto the inclusion of all 2 - way interactions.

Once we performed the necessary data cleaning and have fit the model, let us define some functions to look at the model metrics such as PRESS residuals, AIC, Adj. R^2 which will help us understand the accuracy or fit of the model.

x
AIC 2.805656e+05
BIC 2.806625e+05
adjR2 7.500000e-01
RMSE 9.125500e+01
PRESS 1.970247e+08
nterms 1.100000e+01
## 
## Call:
## lm(formula = totalweight ~ sex + equipment + age + bodyweight + 
##     sex:age + sex:bodyweight + equipment:bodyweight + age:bodyweight, 
##     data = ipf_main_subset2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -619.01  -55.25   -0.49   57.85  332.56 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    153.085233   7.305437  20.955  < 2e-16 ***
## sexM                           144.220603   6.011917  23.989  < 2e-16 ***
## equipmentSingle-ply             26.555316   5.149745   5.157 2.53e-07 ***
## equipmentWraps                 -37.376752  43.460597  -0.860 0.389790    
## age                              0.526839   0.157522   3.345 0.000825 ***
## bodyweight                       3.552207   0.098484  36.069  < 2e-16 ***
## sexM:age                        -1.478453   0.102851 -14.375  < 2e-16 ***
## sexM:bodyweight                  1.196799   0.067436  17.747  < 2e-16 ***
## equipmentSingle-ply:bodyweight   0.541354   0.062790   8.622  < 2e-16 ***
## equipmentWraps:bodyweight        1.472708   0.553301   2.662 0.007781 ** 
## age:bodyweight                  -0.024062   0.002153 -11.176  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91.26 on 23634 degrees of freedom
## Multiple R-squared:  0.7506, Adjusted R-squared:  0.7505 
## F-statistic:  7112 on 10 and 23634 DF,  p-value: < 2.2e-16
## [1] 91.25525

The RMSE is slightly high at ~91 and adj. R^2 is around 0.75 whuch tells that ~75% of the variability of the response being explained with these predictors. Let us look at the diagnostics to understand the problem better before we take a call on the model.

Let us begin by looking at the fitted vs studentized residual plot and the QQ plots to check for problems such as non-constant variance and non-normality.

We see that this fit has a slight problem with non-constant variance and QQ plot shows that it is not perfectly normal and has some skewness towards the left. We do not see much of an issue with serial correlation although there is a time aspect to this data, which is a good thing. We have achieved a decent fit of the model but let us take another shot to fix the non - constant variance to some extent .

We will apply box-cox transformation on the response variable to fix this and then look at the residual plot and model metrics once again.

## [1] 0.5858586

## 
## Call:
## lm(formula = totalweight ~ sex + equipment + age + bodyweight + 
##     sex:age + sex:bodyweight + equipment:bodyweight + age:bodyweight, 
##     data = ipf_main_subset2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.839  -4.055   0.212   4.395  20.765 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    35.2632718  0.5321196  66.269  < 2e-16 ***
## sexM                           14.1796247  0.4379012  32.381  < 2e-16 ***
## equipmentSingle-ply             3.4709687  0.3751015   9.253  < 2e-16 ***
## equipmentWraps                 -1.3096680  3.1656201  -0.414   0.6791    
## age                            -0.0065149  0.0114737  -0.568   0.5702    
## bodyweight                      0.2890032  0.0071735  40.288  < 2e-16 ***
## sexM:age                       -0.0928014  0.0074915 -12.388  < 2e-16 ***
## sexM:bodyweight                 0.0371200  0.0049120   7.557 4.27e-14 ***
## equipmentSingle-ply:bodyweight  0.0206552  0.0045735   4.516 6.32e-06 ***
## equipmentWraps:bodyweight       0.0892475  0.0403018   2.214   0.0268 *  
## age:bodyweight                 -0.0013089  0.0001568  -8.346  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.647 on 23634 degrees of freedom
## Multiple R-squared:  0.753,  Adjusted R-squared:  0.7529 
## F-statistic:  7205 on 10 and 23634 DF,  p-value: < 2.2e-16
## [1] 6.646928
x
AIC 156689.179
BIC 156786.030
adjR2 0.753
RMSE 6.647
PRESS 1045334.301
nterms 11.000
GVIF Df GVIF^(1/(2*Df))
sex 23.49840 1 4.847514
equipment 181.03323 2 3.668087
age 14.06866 1 3.750822
bodyweight 14.53333 1 3.812260
sex:age 12.12624 1 3.482275
sex:bodyweight 26.22333 1 5.120872
equipment:bodyweight 231.69624 2 3.901483
age:bodyweight 26.80871 1 5.177713

We see that the box-cox transformation has fixed the violation of non-constant variance to an extent. A good outcome of this transformation is that, we managed to decrease the BIC by 44% and and RMSE by ~92% (New RMSE: 6.65). In addition, the VIF are seems to be in range and thus, we can conclude that this is indeed a good fit for the model for this phase of analysis.

Final Model

The model which explains the change in the total weight lifted by the power lifter is -

\[totalweight: sex + equipment + age + bodyweight + sex:age + sex:bodyweight + equipment:bodyweight + age:bodyweight\]

The parameters for each of these predictor variables is shown below -

X Estimate Std_Error
(Intercept) 35.26 0.53
sexM 14.18 0.44
equipmentSingle-ply 3.47 0.38
equipmentWraps -1.31 3.17
age -0.01 0.01
bodyweight 0.29 0.01
sexM:age -0.09 0.01
sexM:bodyweight 0.04 0.00
equipmentSingle-ply:bodyweight 0.02 0.00
equipmentWraps:bodyweight 0.09 0.04
age:bodyweight 0.00 0.00

According to the coefficients, we see that the parameters which have the most impact are sex and equipment, which is in line with our prior EDA.


Note - The next phase of the project will solely focus on improving the predective capability of the model.

5. Results


PROBLEM STATEMENT:

The main focus of our analysis is to help both the informed and uninformed readers understand the impact of key factors such as age, sex, bodyweight and equipment usage on the weight lifted across different kinds of powerlifts.


SOLUTION METHODOLOGY:

For better understanding of the impact of these variables, we leveraged graphics to observe the trends of total weight lifted by male and female weight lifters belonging to different age classes . We used categorical variables such as equipment , sex, age class and meet names, to add in context to the analysis and aid in understanding.

We also performed hypothesis testing to statistically prove few observations we saw from the visuals. Finally, we concluded our analysis by fitting a regression model to be able to quantify the relationship of various factors and predict the total amount of weight that can be lifted when provided with input variables such as sex, age, bodyweight and equipment.


FINDINGS & INSIGHTS:

  • Equipment and sex are the 2 most important parameters that influence the powerlifting capacity
  • Although there is a decline in the weight lifting capacity after 40 years of age, we see people over 40 also still are able to lift considerably heavier weights
  • The decline with age is more profound in squat lifting as compared to bench and deadlift categories
  • The decline in weight lifting capacity starts at 30 years for females as compared to 35 years for males. Also this decline is faster as age increases for female power lifters
  • Squat powerlifting has the maximum difference in powerllifting capacities between females and males
  • Powelifters belonging to the age group of 16 to 19 prefer to perform powerlifting with bare knees or knee sleeves
  • The Average total weight lifted using single-ply equipment is greater than that of raw weight lifting.
  • An interesting observation is that the increase in total weight lifted is not proportional to the bodyweight of the person when the bodyweight is greater than 90 kgs
  • Male powerlifters are 4 times more likely to get disqualified
    • People in the age group of 24 to 34 have higher disqualification instances as compared to other age groups
    • A lot of disqualification instances occur in “World Bench Press Championships”
  • Certain meets have significantly higher male participation as compared to females
    • World Disabled Bench Press Championship
    • World Masters Powerlifting Championship
  • In general, there is higher male particpation (> 60%) in all the bench press championships from 1990
  • The female participation has seen an significant increase in participation in “World Classic Powerlifting Championships” from 2013
  • Most of the participants in various meets belong to the age groups of 18 to 39 years - This can be considered the peak period for powerlifters
  • Participation in powerlifting has significantly increased in the last 10 years. Interstingly, female participation has also increased in the last 10 years
  • The average weight lifted by females has constantly increased from 1980 to 2009, with the highest average weight lifted being in the year 2009.


IMPLICATIONS:

This report shows how powerlifting as a sport is gaining traction. It analyzes the historic data to understand what is the best age to achieve peak performance in powerlifting and proves that people over 50 years of age can also lift considerable weights. The reader also understands the section of people who are more likely to get disqualified and that age/ bodyweight do not have any impact on the position of the person in the meet.

The reader is also equipped with a mathematic eqantion to predict the total weight that can be lifted by a powerlifter.


LIMITATIONS:

This report contains a detailed analysis on the factors impacting the performance of powerlifters and how powerlifting, as a sport has evolved over the years. However, this analysis was conducted on a smaller dataset and not on the full dataset. The additional data might give us a little more insight into the analysis. (There is an equally likely chance that it might not, but definitely worth a try).

The modeling part can be improved upon further (This will be done in the next phase of the project).

  • Improvement can be in terms of different algrithms tried to improve the predictive capability
  • It can also mean making the model more interpretable, which will make it more actionable and comprehendible for the consumer of the report